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In linear regression problems with related predictors, it is desir- 
able to do variable selection and estimation by maintaining the hi- 
erarchical or structural relationships among predictors. In this paper 
we propose non-negative garrote methods that can naturally incor- 
porate such relationships defined through effect heredity principles 
or marginality principles. We show that the methods are very easy 
to compute and enjoy nice theoretical properties. We also show that 
the methods can be easily extended to deal with more general regres- 
sion problems such as generalized linear models. Simulations and real 
examples are used to illustrate the merits of the proposed methods. 

1. Introduction. When considering regression with a large number of 
predictors, variable selection becomes important. Numerous methods have 
been proposed in the literature for the purpose of variable selection, ranging 
from the classical information criteria such as AIC and BIC to regularization 
based modern techniques such as the nonnegative garrote [Breiman (1995)], 
the Lasso [Tibshirani (1996)] and the SCAD [Fan and Li (2001)], among 
many others. Although these methods enjoy excellent performance in many 
applications, they do not take the hierarchical or structural relationship 
among predictors into account and therefore can lead to models that are 
hard to interpret. 

Consider, for example, multiple linear regression with both main effects 
and two-way interactions where a dependent variable Y and q explanatory 
variables X\ , X 2 , • • • , X q are related through 

(1.1) Y = P 1 X 1 + ■■■ + p q X q + faXl + p 12 X 1 X 2 + ■■■ + p qq X 2 q + e, 
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where e ~ M(0, a 2 ). Commonly used general purpose variable selection tech- 
niques, including those mentioned above, do not distinguish interactions 
XiXj from main effects Xi and can select a model with an interaction but 
neither of its main effects, that is, flij ^ and = (3j = 0. It is therefore use- 
ful to invoke the so-called effect heredity principle [Hamada and Wu (1992)] 
in this situation. There are two popular versions of the heredity principle 
[Chipman (1996)]. Under strong heredity, for a two-factor interaction ef- 
fect X{Xj to be active both its parent effects, Xi and Xj, should be active; 
whereas under weak heredity only one of its parent effects needs to be active. 
Likewise, one may also require that Xf can be active only if Xi is also active. 
The strong heredity principle is closely related to the notion of marginality 
[Nelder (1977), McCullagh and Nelder (1989), Nelder (1994)] which ensures 
that the response surface is invariant under scaling and translation of the 
explanatory variables in the model. Interested readers are also referred to 
McCullagh (2002) for a rigorous discussion about what criteria a sensible 
statistical model should obey. Li, Sudarsanam and Prey (2006) recently con- 
ducted a meta-analysis of 113 data sets from published factorial experiments 
and concluded that an overwhelming majority of these real studies conform 
with the heredity principles. This clearly shows the importance of using 
these principles in practice. 

These two heredity concepts can be extended to describe more general 
hierarchical structure among predictors. With slight abuse of notation, write 
a general multiple linear regression as 

(1.2) Y = X(3 + e, 

where X = (X\,X2, X p ) and j3 = (/?i, . . . , f3 p )'. Throughout this paper, we 
center each variable so that the observed mean is zero and, therefore, the re- 
gression equation has no intercept. In its most general form, the hierarchical 
relationship among predictors can be represented by sets \T>i : i = 1, . . . ,p}, 
where T>i contains the parent effects of the ith predictor. For example, the 
dependence set of XiXj is {Xi,Xj} in the quadratic model (1.1). In order 
that the zth variable can be considered for inclusion, all elements of T>i must 
be included under the strong heredity principle, and at least one element 
of T>i should be included under the weak heredity principle. Other types of 
heredity principles, such as the partial heredity principle [Nelder (1998)], can 
also be incorporated in this framework. The readers are referred to Yuan, 
Joseph and Lin (2007) for further details. As pointed out by Turlach (2004), 
it could be very challenging to conform with the hierarchical structure in 
the popular variable selection methods. In this paper we specifically address 
this issue and consider how to effectively impose such hierarchical structures 
among the predictors in variable selection and coefficient estimation, which 
we refer to as structured variable selection and estimation. 
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Despite its great practical importance, structured variable selection and 
estimation has received only scant attention in the literature. Earlier in- 
terests in structured variable selection come from the analysis of designed 
experiments where heredity principles have proven to be powerful tools in 
resolving complex aliasing patterns. Hamada and Wu (1992) introduced a 
modified stepwise variable selection procedure that can enforce effect hered- 
ity principles. Later, Chipman (1996) and Chipman, Hamada and Wu (1997) 
discussed how the effect heredity can be accommodated in the stochas- 
tic search variable selection method developed by George and McCulloch 
(1993). See also Joseph and Delaney (2007) for another Bayesian approach. 
Despite its elegance, the Bayesian approach can be computationally demand- 
ing for large scale problems. Recently, Yuan, Joseph and Lin (2007) proposed 
generalized LARS algorithms [Osborne, Presnell and Turlach (2000), Efron 
et al. (2004)] to incorporate heredity principles into model selection. Efron 
et al. (2004) and Turlach (2004) also considered alternative strategies to en- 
force the strong heredity principle in the LARS algorithm. Compared with 
earlier proposals, the generalized LARS procedures enjoy tremendous com- 
putational advantages, which make them particularly suitable for problems 
of moderate or large dimensions. However, Yuan and Lin (2007) recently 
showed that LARS may not be consistent in variable selection. Moreover, 
the generalized LARS approach is not flexible enough to incorporate many 
of the hierarchical structures among predictors. More recently, Zhao, Rocha 
and Yu (2009) and Choi, Li and Zhu (2006) proposed penalization methods 
to enforce the strong heredity principle in fitting a linear regression model. 
However, it is not clear how to generalize them to handle more general 
heredity structures and their theoretical properties remain unknown. 

In this paper we propose a new framework for structured variable se- 
lection and estimation that complements and improves over the existing 
approaches. We introduce a family of shrinkage estimator that is similar 
in spirit to the nonnegative garrote, which Yuan and Lin (2007) recently 
showed to enjoy great computational advantages, nice asymptotic properties 
and excellent finite sample performance. We propose to incorporate struc- 
tural relationships among predictors as linear inequality constraints on the 
corresponding shrinkage factors. The resulting estimates can be obtained as 
the solution of a quadratic program and very efficiently solved using the 
standard quadratic programming techniques. We show that, unlike LARS, 
it is consistent in both variable selection and estimation provided that the 
true model has such structures. Moreover, the linear inequality constraints 
can be easily modified to adapt to any situation arising in practical problems 
and therefore is much more flexible than the existing approaches. We also 
extend the original nonnegative garrote as well as the proposed structured 
variable selection and estimation methods to deal with the generalized linear 
models. 
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The proposed approach is much more flexible than the generalized LARS 
approach in Yuan, Joseph and Lin (2007). For example, suppose a group of 
variables is expected to follow strong heredity and another group weak hered- 
ity, then in the proposed approach we only need to use the corresponding 
constraints for strong and weak heredity in solving the quadratic program, 
whereas the approach of Yuan, Joseph and Lin (2007) is algorithmic and 
therefore requires a considerable amount of expertise with the generalized 
LARS code to implement these special needs. However, there is a price to 
be paid for this added flexibility: it is not as fast as the generalized LARS. 

The rest of the paper is organized as follows. We introduce the method- 
ology and study its asymptotic properties in the next section. In Section 3 
we extend the methodology to generalized linear models. Section 4 discusses 
the computational issues involved in the estimation. Simulations and real 
data examples are presented in Sections 5 and 6 to illustrate the proposed 
methods. We conclude with some discussions in Section 7. 

2. Structured variable selection and estimation. The original nonnega- 
tive garrote estimator introduced by Breiman (1995) is a scaled version of 
the least square estimate. Given n independent copies (xi, y±), . . . , (x n , y n ) 
of (X, Y) where A is a p-dimensional covariate and Y is a response variable, 
the shrinkage factor 9(M) = (0i(M), . . . ,9 P (M))' is given as the minimizer 
to 

1 v 
(2.1) -\\Y - Z9\\ 2 , subject to < M and 0j > Vj, 

3=1 

where, with slight abuse of notation, Y = (y±, . . . , y n )' , Z = (zi, . . . , z n )', 
and Zj is a p dimensional vector whose jth element is Xijf3j and /3 LS is 
the least square estimate based on (1.2). Here M > is a tuning parameter. 
The nonnegative garrote estimate of the regression coefficient is subsequently 
defined as f3j(M) = 6j(M)/3^ s , j = 1, . . . ,p. With an appropriately chosen 
tuning parameter M, some of the scaling factors could be estimated by ex- 
act zero and, therefore, the corresponding predictors are eliminated from 
the selected model. 

2.1. Strong heredity principles. Following Yuan, Joseph and Lin (2007), 
let T>i contain the parent effects of the ith. predictor. Under the strong hered- 
ity principle, we need to impose the constraint that f3j ^ for any j G T>i if 
Pi 7^ 0. A naive approach to incorporating effect heredity is therefore to min- 
imize (2.1) under this additional constraint. However, in doing so, we lose 
the convexity of the optimization problem and generally will end up with 
problems such as multiple local optima and potentially NP hardness. Recall 
that the nonnegative garrote estimate of is (3j jS 9i(M). Since /3^ s with 
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Fig. 1. Feasible region of the nonnegative garrote with (right) and without (left) strong 
heredity constraints. 

probability one, Xi will be selected if and only if scaling factor 9^ > 0, in 
which case 9i behaves more or less like an indicator of the inclusion of Xi in 
the selected model. Therefore, the strong heredity principles can be enforced 
by requiring 



Note that if 9i > 0, (2.2) will force the scaling factor for all its parents to be 
positive and consequently active. Since these constraints are linear in terms 
of the scaling factor, minimizing (2.1) under (2.2) remains a quadratic pro- 
gram. Figure 1 illustrates the feasible region of the nonnegative garrote with 
such constraints in contrast with the original nonnegative garrote where no 
heredity rules are enforced. We consider two effects and their interaction 
with the corresponding shrinking factors denoted by 9\, 82 and 812, respec- 
tively. In both situations the feasible region is a convex polyhedron in the 
three dimensional space. 

2.2. Weak heredity principles. Similarly, when considering weak heredity 
principles, we can require that 



However, the feasible region under such constraints is no longer convex as 
demonstrated in the left panel of Figure 2. Subsequently, minimizing (2.1) 
subject to (2.3) is not feasible. To overcome this problem, we suggest using 
the convex envelop of these constraints for the weak heredity principles: 





Vj G Vi. 



(2.3) 



9% < max 9j . 



(2.4) 
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Again, these constraints are linear in terms of the scaling factor and min- 
imizing (2.1) under (2.4) remains a quadratic program. Note that Qi > 
implies that X)jex>i ®i > ^ ana -> therefore, (2.4) will require at least one of its 
parents to be included in the model. In other words, constraint (2.4) can be 
employed in place of (2.3) to enforce the weak heredity principle. The small 
difference between the feasible regions of (2.3) and (2.4) also suggests that 
the selected model may only differ slightly between the two constraints. We 
opt for (2.4) because of the great computational advantage it brings about. 

2.3. Asymptotic properties. To gain further insight to the proposed struc- 
tured variable selection and estimation methods, we study their asymptotic 
properties. We show here that the proposed methods estimate the zero co- 
efficients by zero with probability tending to one, and at the same time give 
root-n consistent estimate to the nonzero coefficients provided that the true 
data generating mechanism satisfies such heredity principles. Denote by A 
the indices of the predictors in the true model, that is, A = {j : (3j ^ 0}. 
Write /3 SVS as the estimate obtained from the proposed structured variable 
selection procedure. 

Under strong heredity, the shrinkage factors # SVS (M) can be equivalently 
written in the Lagrange form [Boyd and Vandenberghe (2004)] 

argrnin^||Y - Z6\\ 2 + A n ^%j , 

subject to 6j > 0, 6j < min^Vj &k for some Lagrange parameter A n > 0. For 
the weak heredity principle, we replace the constraints 8j < min^g^ @k with 




Fig. 2. Feasible region of the nonnegative garrote with constraint 812 < max(#i,#2) (left) 
and the relaxed constraint 812 < 81 + 82 (right). 
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Theorem 1. Assume that X'X/n — > £ and £ is positive definite. If the 
true model satisfies the strong/weak heredity principles, and X n — > oo in a 
fashion such that X n = o(y / n) as n goes to +00, then the structured estimate 
with the corresponding heredity principle satisfies P(/3? vs = 0) — y 1 for any 
j £ A, and /3| vs - 0j = Opin" 1 / 2 ) if j G A. 

All the proofs can be accessed as the supplement materials. Note that 
when A n = 0, there is no penalty and the proposed estimates reduce to 
the least squares estimate which is consistent in estimation. The theorems 
suggest that if instead the tuning parameter A n escapes to infinity at a rate 
slower than y/n, the resulting estimates not only achieve root-n consistency 
in terms of estimation but also are consistent in variable selection, whereas 
the ordinary least squares estimator does not possess such model selection 
ability. 

3. Generalized regression. The nonnegative garrote was originally intro- 
duced for variable selection in multiple linear regression. But the idea can be 
extended to more general regression settings where Y depends on X through 
a scalar parameter rj(X) = (3q + X/3, where (/3o,/3')' is a (p + l)-dimensional 
unknown coefficient vector. It is worth pointing out that such extensions 
have not been proposed in literature so far. 

A common approach to estimating rj is by means of the maximum like- 
lihood. Let £(Y,rf(X)) be a negative log conditional likelihood function of 
Y\X. The maximum likelihood estimate is given as the minimizer of 

n 

L(F,r ? (X)) = ^^,r ? (x,)). 

i=l 

For example, in logistic regression, 

^(y i ,77(x i )) = ^(x i )-log(l + e "W). 

More generally, I can be replaced with any loss functions such that its ex- 
pectation E(£(Y, rj(X))) with respect to the joint distribution of (X, Y) is 
minimized at r/(-). 

To perform variable selection, we propose the following extension of the 
original nonnegative garrote. We use the maximum likelihood estimate /3 MLE 
as a preliminary estimate of (3. Similar to the original nonnegative garrote, 
define Zj = Xj/3j^ LE . Next we estimate the shrinkage factors by 

(3.1) (§o, § SYS Y = argminL(Y, Z6 + O ), 

6o,9 

subject to ^2/9j < M and 9j > for any j = 1, . . . , p. In the case of normal 
linear regression, L becomes the least squares and it is not hard to see 
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that the solution of (3.1) always satisfies Oq = because all variables are 
centered. Therefore, without loss of generality, we could assume that there 
is no intercept in the normal linear regression. The same, however, is not true 
for more general L and, therefore, Oq is included in (3.1). Our final estimate 
of f3j is then given as /3j^ LE #j (M) for j = 1, . . . ,p. To impose the strong 
or weak heredity principle, we add additional constraints Oj < vamkeVj 9k or 
9j < EfeeD, Ok, respectively. 

Theorem 1 can also be extended to more general regression settings. Sim- 
ilar to before, under strong heredity, 

§ SVS (M) = argmin f L(Y, Z9 + O ) + K V Oj ) , 
e °' e \ j=i J 

subject to 9j > 0,9j < minfegu .Ok for some A„ > 0. Under weak heredity 
principles, we use the constraints 0j < X^fceo @k ms tead of 0j < min^g^ &k- 
We shall assume that the following regularity conditions hold: 

(A.l) £(■, •) is a strictly convex function of the second argument; 
(A. 2) the maximum likelihood estimate fi MLE is root-n consistent; 
(A. 3) the observed information matrix converges to a positive definite 
matrix, that is, 

n 

(3.2) i Y, x ^%> ^ MLE + 4 MLE ) S, 

i=i 

where E is a positive definite matrix. 

Theorem 2. Under regularity conditions (A.1)-(A.3), if \ n — > oo in a 
fashion such that X n = o(\fn) as n goes to +oo, then P(/3^ vs = 0) — > 1 for 

any j £ A, and /3? vs — f3j = O p {n~ 1 / 2 ) if j G A provided that the true model 
satisfies the same heredity principles. 

4. Computation. Similar to the original nonnegative garrote, the pro- 
posed structured variable selection and estimation procedure proceeds in 
two steps. First the solution path indexed by the tuning parameter M is 
constructed. The second step, oftentimes referred to as tuning, selects the 
final estimate on the solution path. 

4.1. Linear regression. We begin with linear regression. For both types 
of heredity principles, the shrinkage factors for a given M can be obtained 
from solving a quadratic program of the following form: 

(4.1) minf-HY- Z9\\ 2 \ subject to ^ Qj < M and HO y 0, 
6 V 2 / j=i 
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where H is a m x p matrix determined by the type of heredity principles, is 
a vector of zeros, and ^ means "greater than or equal to" in an element-wise 
manner. Equation (4.1) can be solved efficiently using standard quadratic 
programming techniques, and the solution path of the proposed structured 
variable selection and estimation procedure can be approximated by solving 
(4.1) for a fine grid of M's. 

Recently, Yuan and Lin (2006, 2007) showed that the solution path of the 
original nonnegative garrote is piecewise linear, and used this to construct 
an efficient algorithm for building its whole solution path. The original non- 
negative garrote corresponds to the situation where the matrix H of (4.1) 
is a p x p identity matrix. Similar results can be expected for more general 
scenarios including the proposed procedures, but the algorithm will become 
considerably more complicated and running quadratic programming for a 
grid of tuning parameter tends to be a computationally more efficient alter- 
native. 

Write B LS = diag(/?P, . . . ,/3p S ). The objective function of (4.1) can be 
expressed as 

\\\Y - Z6\\ 2 = \{Y'Y - 26'BX'Y + Q'BX'XBQ). 
Because Y'Y does not depend on 9's, (4.1) is equivalent to 

minf -O'BX'Y + -6'BX'XB9) 

(4.2) 

v 

subject to ^2,6 j < M and H6h0, 

i=i 

which depends on the sample size n only through X'Y and the Gram matrix 
X'X. Both quantities are already computed in evaluating the least squares. 
Therefore, when compared with the ordinary least squares estimator, the 
additional computational cost of the proposed estimating procedures is free 
of sample size n. 

Once the solution path is constructed, our final estimate is chosen on 
the solution path according to certain criterion. Such criterion often reflects 
the prediction accuracy, which depends on the unknown parameters and 
needs to be estimated. A commonly used criterion is the multifold cross 
validation (CV). Multifold CV can be used to estimate the prediction error of 
an estimator. The data C = {(y^Xj) :i = 1, . . . ,n} are first equally split into 
V subsets C\, . . . , Cy. Using the proposed method, and data LS V > = C — C v , 
construct estimate /3^(M). The CV estimate of the prediction error is 
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We select the tuning parameter M by minimizing PE(/3(M)). It is often 
suggested to use V = 10 in practice [Breiman (1995)]. 
It is not hard to see that VE0^(M)) estimates 

PE(/3(M)) = na 2 + n(/3 - /3(M))' E(X' X)((3 - $(M)). 

Since the first term is the inherent prediction error due to the noise, one 
often measures the goodness of an estimator using only the second term, 
referred to as the model error: 

(4.3) ME(/3(M)) = {fi- P(M))'E(X'X)(P - P(M)). 

Clearly, we can estimate the model error as PE(/§(M))/n — a 2 , where a 2 is 
the noise variance estimate obtained from the ordinary least squares estimate 
using all predictors. 

4.2. Generalized regression. Similarly for more general regression set- 
tings, we solve 

v 

(4.4) minL(y, Z9 + ) subject to ^ 9j < M and H9 h 

' i=i 

for some matrix H. This can be done in an iterative fashion provided that the 
loss function L is strictly convex in its second argument. At each iteration, 
denote (9^,9^) the estimate from the previous iteration. We now approxi- 
mate the objective function using a quadratic function around {9^\ 9^) and 
update the estimate by minimizing 

n , 

+ €)W - ^ [ ° ] ) + (°o ~ #)] 

i=X ^ 

+ \i"{m^ + ef )[*{<> - eW) + % - ^i 01 )] 2 ) , 

subject to 0j — M and H9 >z. 0, where the derivatives are taken with 
respect to the second argument of I. Now it becomes a quadratic program. 
We repeat this until a certain convergence criterion is met. 

In choosing the optimal tuning parameter M for general regression, we 
again use the multifold cross-validation. It proceeds in the same fashion as 
before except that we use a loss-dependent cross-validation score: 

^ ^ l(y^S {v) (M) + ^\M)). 

5. Simulations. In this section we investigate the finite sample properties 
of the proposed estimators. To fix ideas, we focus our attention on the usual 
normal linear regression. 
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5.1. Effect of structural constraints. We first consider a couple of mod- 
els that represent different scenarios that may affect the performance of the 
proposed methods. In each of the following models, we consider three ex- 
planatory variables X\ , X2 , X 3 that follow a multivariate normal distribution 
with cov(Xi,Xj) = p\ l ~j\ with three different values for p: 0.5,0 and —0.5. 
A quadratic model with nine terms 

Y = ftXi + p 2 X 2 + P 3 X 3 + PuXl + ^ 2 X X X 2 + • • • + p 33 Xl + e 

is considered. Therefore, we have a total of nine predictors, including three 
main effects, three quadratic terms and three two-way interactions. To demon- 
strate the importance of accounting for potential hierarchical structure among 
the predictor variables, we apply the nonnegative garrote estimator that rec- 
ognizes strong heredity, weak heredity and without heredity constraints. In 
particular, we enforce the strong heredity principle by imposing the following 
constraints: 

#11<#1, #12<#1, #12 < #2, #13<#lj #13 < #3; 

#22 < #2, #23 < #2, #23 < #3> #33 < #3- 

To enforce the weak heredity, we require that 

#11 < #1, #12<#l+#2, #13<#l+#3> 

#22 < #2, #23 < #2 + #3) #33 < #3- 

We consider two data-generating models, one follows the strong heredity 
principles and the other follows the weak heredity principles: 

Model I. The first model follows the strong heredity principle: 

(5.1) F = 3X 1 +2X 2 + 1.5X 1 Jsr 2 +e; 

Model II. The second model is similar to Model I except that the true 
data generating mechanism now follows the weak heredity principle: 

(5.2) Y = 3Xi + 2X1 + 1-5XlX 2 + e. 

For both models, the regression noise e ~ iV(0,3 2 ). 

For each model, 50 independent observations of (Xi,X2,X 3 ,Y) are col- 
lected, and a quadratic model with nine terms is analyzed. We choose the 
tuning parameter by ten-fold cross-validation as described in the last section. 
Following Breiman (1995), we use the model error (4.3) as the gold standard 
in comparing different methods. We repeat the experiment for 1000 times for 
each model and the results are summarized in Table 1. The numbers in the 
parentheses are the standard errors. We can see that the model errors are 
smaller for both weak and strong heredity models compared to the model 
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that does not incorporate any of the heredity principles. Paired t-tests con- 
firmed that most of the observed reductions in model error are significant 
at the 5% level. 

For Model I, the nonnegative garrote that respects the strong heredity 
principles enjoys the best performance, followed by the one with weak hered- 
ity principles. This example demonstrates the benefit of recognizing the ef- 
fect heredity. Note that the model considered here also follows the weak 
heredity principle, which explains why the nonnegative garrote estimator 
with weak heredity outperforms the one that does not enforce any heredity 
constraints. For Model II, the nonnegative garrote with weak heredity per- 
forms the best. Interestingly, the nonnegative garrote with strong heredity 
performs better than the original nonnegative garrote. One possible expla- 
nation is that the reduced feasible region with strong heredity, although 
introducing bias, at the same time makes tuning easier. 

To gain further insight, we look into the model selection ability of the 
structured variable selection. To separate the strength of a method and 
effect of tuning, for each of the simulated data, we check whether or not 
there is any tuning parameter such that the corresponding estimate conforms 
with the true model. The frequency for each method to select the right 
model is given in Table 2, which clearly shows that the proposed structured 
variable selection methods pick the right models more often than the original 
nonnegative garrote. Note that the strong heredity version of the method 
can never pick Model II correctly as it violates the strong heredity principle. 

Table 1 

Model error comparisons. Model I satisfies strong heredity and 
Model II satisfies weak heredity 



No heredity Weak heredity Strong heredity 







Model I 




p = 0.5 


1.79 


1.70 


1.59 




(0.05) 


(0.05) 


(0.04) 


p = 


1.57 


1.56 


1.43 




(0.04) 


(0.04) 


(0.04) 


p= -0.5 


1.78 


1.69 


1.54 




(0.05) 


(0.04) 


(0.04) 






Model II 




p = 0.5 


1.77 


1.61 


1.72 




(0.05) 


(0.05) 


(0.04) 


P = 


1.79 


1.53 


1.70 




(0.05) 


(0.04) 


(0.04) 


p = -0.5 


1.79 


1.68 


1.76 




(0.04) 


(0.04) 


(0.04) 
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Table 2 

Frequency of selecting the right model 







No heredity 


Weak heredity 


Strong heredity 








Model I 




p = 


0.5 


65.5% 


71.5% 


82.0% 


p = 





85.0% 


86.5% 


90.5% 


p= 


-0.5 


66.5% 


73.5% 


81.5% 








Model II 




p = 


0.5 


65.5% 


75.5% 


0.00% 


p = 





83.0% 


90.0% 


0.00% 


p = 


-0.5 


56.5% 


72.5% 


0.00% 




Fig. 3. Effect of the magnitude of the interactions. 



We also want to point out that such comparison, although useful, needs to 
be understood with caution. In practice, no model is perfect and selecting 
an additional main effect X2 so that Model II can satisfy strong heredity 
may be a much more preferable alternative to many. 

We also checked how effective the ten-fold cross-validation is in picking 
the right model when it does not follow any of the heredity principles. We 
generated the data from the model 

Y = 3X 1 + 2Xl + 1-5^| + e, 

where the set up for simulation remains the same as before. Note that this 
model does not follow any of the heredity principles. For each run, we ran the 
nonnegative garrote with weak heredity, strong heredity and no heredity. We 
chose the best among these three estimators using ten-fold cross-validation. 
Note that the three estimators may take different values of the tuning pa- 
rameter. Among 1000 runs, 64.1% of the time, nonnegative garrote with no 
heredity principle was elected. In contrast, for either Model I or Model II 
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with a similar setup, less than 10% of the time nonnegative garrote with no 
heredity principle was elected. This is quite a satisfactory performance. 

5.2. Effect of the size of the interactions. The next example is designed 
to illustrate the effect of the magnitude of the interaction on the proposed 
methods. We use a similar setup as before but now with four main effects 
X\,X2,X^,X4 : , four quadratic terms and six two-way interactions. The true 
data generating mechanism is given by 

(5.3) Y = 3X 1 + 2X 2 + 1.5*3 + Oi(X x X 2 - X X X Z ) + e, 

where a = 1,2,3,4 and e ~ N(0,o~ 2 ) with a 2 chosen so that the signal-to- 
noise ratio is always 3:1. Similar to before, the sample size n = 50. Figure 3 
shows the mean model error estimated over 1000 runs. We can see that 
the strong and weak heredity models perform better than the no heredity 
model and the improvement becomes more significant as the strength of the 
interaction effect increases. 

5.3. Large p. To fix the idea, we have focused on using the least squares 
estimator as our initial estimator. The least squares estimators are known 
to perform poorly when the number of predictors is large when compared 
with the sample size. In particular, it is not applicable when the number 
of predictors exceeds the sample size. However, as shown in Yuan and Lin 
(2007), other initial estimators can also be used. In particular, they suggested 
ridge regression as one of the alternatives to the least squares estimator. To 
demonstrate such an extension, we consider again the regression model (5.3) 
but with ten main effects X\, . . . , *io and ten quadratic terms, as well as 
45 interactions. The total number of effects (p = 65) exceeds the number of 
observations (n = 50) and, therefore, the ridge regression tuned with GCV 
was used as the initial estimator. Figure 4 shows the solution path of the 
nonnegative garrote with strong heredity, weak heredity and without any 
heredity for a typical simulated data with a = 4. 

It is interesting to notice from Figure 4 that the appropriate heredity 
principle, in this case strong heredity, is extremely valuable in distinguishing 
the true effect X% from other spurious effects. This further confirms the 
importance of heredity principles. 

6. Real data examples. In this section we apply the methods from Sec- 
tion 2 to several real data examples. 

6.1. Linear regression example. The first is the prostate data, previously 
used in Tibshirani (1996). The data consist of the medical records of 97 male 
patients who were about to receive a radical prostatectomy. The response 
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Strong Heredity Weak Heredity No Heredity 




FlG. 4. Simulation when p > n: solution for different versions of the nonnegative garrote. 

Strong Heredity Weak Heredity No Heredity 




variable is the level of prostate specific antigen, and there are 8 explanatory 
variables. The explanatory variables are eight clinical measures: log(cancer 
volume) (lcavol), log(prostate weight) (lweight), age, log(benign prostatic 
hyperplasia amount) (lbph), seminal vesicle invasion (svi), log(capsular pen- 
etration) (lcp), Gleason score (gleason) and percentage Gleason scores 4 or 5 
(pgg45). We consider model (1.1) with main effects, quadratic terms and two 
way interactions, which gives us a total of 44 predictors. Figure 5 gives the 
solution path of the nonnegative garrote with strong heredity, weak heredity 
and without any heredity constraints. The vertical grey lines represent the 
models that are selected by the ten-fold cross-validation. 

To determine which type of heredity principle to use for the analysis, we 
calculated the ten-fold cross-validation scores for each method. The cross- 
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validation scores as functions of the tuning parameter M are given in the 
right panel of Figure 6. 

Cross-validation suggests the validity of heredity principles. The strong 
heredity is particularly favored with the smallest score. Using ten-fold cross- 
validation, the original nonnegative garrote that neglects the effect heredity 
chooses a six variable model: Icavol, Iweight, Ibph, gleason 2 , lbph:svi and 
svi:pgg45. Note that this model does not satisfy heredity principle, because 
gleason 2 and svi:pgg45 are included without any of its parent factors. In 
contrast, the nonnegative garrote with strong heredity selects a model with 
seven variables: Icavol, Iweight, Ibph, svi, gleason, gleason 2 and lbph:svi. The 
model selected by the weak heredity, although comparable in terms of cross 
validation score, is considerably bigger with 16 variables. The estimated 
model errors for the strong heredity, weak heredity and no heredity nonneg- 
ative garrote are 0.18, 0.19 and 0.30, respectively, which clearly favors the 
methods that account for the effect heredity. 

To further assess the variability of the ten-fold cross-validation, we also 
ran the leave-one-out cross-validation on the data. The leave-one-out scores 
are given in the right panel of Figure 6. It shows a similar pattern as the 
ten-fold cross-validation. In what follows, we shall continue to use the ten- 
fold cross-validation because of the tremendous computational advantage it 
brings about. 

6.2. Logisitic regression example. To illustrate the strategy in more gen- 
eral regression settings, we consider a logistic regression for the South African 
heart disease data previously used in Hastie, Tibshirani and Friedman (2003). 
The data consist of 9 different measures of 462 subjects and the responses in- 
dicating the presence of heart disease. We again consider a quadratic model. 
There is one binary predictor which leaves a total of 53 terms. Nonnega- 




FlG. 6. Cross-validation scores for the prostate data. 
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Fig. 7. Solution paths for the heart data. 
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M 



Fig. 8. Cross-validation scores for the heart data. 

tive garrote with strong heredity, weak heredity and without heredity were 
applied to the data set. The solution paths are given in Figure 7. 

The cross-validation scores for the three different methods are given in 
Figure 8. As we can see from the figure, nonnegative garrote with strong 
heredity principles achieves the lowest cross-validation score, followed by 
the one without heredity principles. 

6.3. Prediction performance on several benchmark data. To gain further 
insight on the merits of the proposed structured variable selection and esti- 
mation techniques, we apply them to seven benchmark data sets, including 
the previous two examples. The Ozone data, originally used in Breiman 
and Friedman (1985), consist of the daily maximum one- hour-average ozone 
reading and eight meteorological variables in the Los Angeles basin for 330 
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days in 1976. The goal is to predict the daily maximum one-hour-average 
ozone reading using the other eight variables. The Boston housing data 
include statistics for 506 census tracts of Boston from the 1970 census [Har- 
rison and Rubinfeld (1978)]. The problem is to predict the median value of 
owner-occupied homes based on 13 demographic and geological measures. 
The Diabetes data, previously analyzed by Efron et al. (2004), consist of 
eleven clinical measurements for a total of 442 diabetes patients. The goal 
is to predict a quantitative measure of disease progression one year after the 
baseline using the other ten measures that were collected at the baseline. 
Along with the prostate data, these data sets are used to demonstrate our 
methods in the usual normal linear regression setting. 

To illustrate the performance of the structured variable selection and 
estimation in more general regression settings, we include two other logistic 
regression examples along with the South African Heart data. The Pima 
Indians Diabetes data have 392 observations on nine variables. The purpose 
is to predict whether or not a particular subject has diabetes using eight 
remaining variables. The BUPA Liver Disorder data include eight variables 
and the goal is to relate a binary response with seven clinical measurements. 
Both data sets are available from the UCI Repository of machine learning 
databases [Newman et al. (1998)]. 

We consider methods that do not incorporate heredity principles or re- 
spect weak or strong heredity principles. For each method, we estimate the 
prediction error using ten-fold cross-validation, that is, the mean squared 
error in the case of the four linear regression examples, and the misclassifi- 
cation rate in the case of the three classification examples. Table 3 documents 
the findings. Similar to the Heart data we discussed earlier, the total number 
of effects (p) can be different for the same number of main effects (q) due to 
the existence of binary variables. As the results from Table 3 suggest, incor- 
porating the heredity principles leads to improved prediction for all seven 
data sets. Note that for the four regression data sets, the prediction error 
depends on the scale of the response and therefore should not be compared 
across data sets. For example, the response variable of the diabetes data 
ranges from 25 to 346 with a variance of 5943.331. In contrast, the response 
variable of the prostate data ranges from —0.43 to 5.48 with a variance of 
1.46. 

7. Discussions. When a large number of variables are entertained, vari- 
able selection becomes important. With a number of competing models that 
are virtually indistinguishable in fitting the data, it is often advocated to 
select a model with the smaller number of variables. But this principle alone 
may lead to models that are not interpretable. In this paper we proposed 
structured variable selection and estimation methods that can effectively 
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Table 3 

Prediction performance on seven real data sets 



Data 


n 


q 


P 


No heredity 


Weak heredity 


Strong heredity 


Boston 


506 


13 


103 


12.609 


12.403 


12.661 


Diabetes 


442 


10 


64 


3077.471 


2987.447 


3116.989 


Ozone 


203 


9 


54 


16.558 


15.100 


15.397 


Prostate 


97 


8 


44 


0.624 


0.632 


0.584 


BUPA 


345 


6 


27 


0.287 


0.279 


0.267 


Heart 


462 


9 


53 


0.286 


0.275 


0.262 


Pima 


392 


8 


44 


0.199 


0.214 


0.196 



incorporate the hierarchical structure among the predictors in variable se- 
lection and regression coefficient estimation. The proposed methods select 
models that satisfy the heredity principle and are much more interpretable 
in practice. The proposed methods adopt the idea of the nonnegative gar- 
rote and inherit its advantages. They are easy to compute and enjoy good 
theoretical properties. 

Similar to the original nonnegative garrote, the proposed method involves 
the choice of a tuning parameter which also amounts to the selection of a final 
model. Throughout the paper, we have focused on using the cross-validation 
for such a purpose. Other tuning methods could also be used. In particular, 
it is known that prediction-based tuning may result in unnecessarily large 
models. Several heuristic methods are often adopted in practice to alleviate 
such problems. One of the most popular choices is the so-called one standard 
error rule [Breiman et al. (1984)], where instead of choosing the model that 
minimizes the cross-validation score, one chooses the simplest model with 
a cross-validation score within one standard error from the smallest. Our 
experience also suggests that a visual examination of the solution path and 
the cross-validation scores often leads to further insights. 

The proposed method can also be used in other statistical problems when- 
ever the structures among predictors should be respected in model building. 
In some applications, certain predictor variables may be known apriori to 
be more important than the others. This may happen, for example, in time 
series prediction where more recent observations generally should be more 
predictive of future observations. 
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